Blitz Classifiers (v0.01)

Here we will present Blitz Classifiers in Scikit-Learn.

The main idea here is to use a simple concept to choose the best algorithm that fit in your data.

Note the main funciton of Blitz Classifiers it's to simplify the initial algorithm and after that, you as a Machine Learning Engineer can choose the best algorithm that solve your problem considering complexity, scalability and knowledge.

First at all, let's import some useful libraries.



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import cross_validation
from sklearn import metrics
from sklearn.metrics import mean_squared_error









    



/Users/flavio.clesio/anaconda/lib/python3.5/site-packages/pandas/computation/__init__.py:19: UserWarning: The installed version of numexpr 2.4.4 is not supported in pandas and will be not be used

  UserWarning)
/Users/flavio.clesio/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In this time we'll import the following classifiers of scikit-learn:

Random Forest
Gradient Boosting
Extra Trees
AdaBoost
SVC
KNeighbors
Decision Tree
Perceptron
Logistic Regression



In [2]:

    
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import LogisticRegression

Now we'll import a structured dataset that all columns are numeric.



In [3]:

    
credit = pd.read_csv('https://raw.githubusercontent.com/fclesio/learning-space/master/Datasets/02%20-%20Classification/default_credit_card.csv')

Let's see out dataset.



In [4]:

    
credit.head()









    Out[4]:






  
    
      
      ID
      LIMIT_BAL
      SEX
      EDUCATION
      MARRIAGE
      AGE
      PAY_0
      PAY_2
      PAY_3
      PAY_4
      ...
      BILL_AMT4
      BILL_AMT5
      BILL_AMT6
      PAY_AMT1
      PAY_AMT2
      PAY_AMT3
      PAY_AMT4
      PAY_AMT5
      PAY_AMT6
      DEFAULT
    
  
  
    
      0
      1
      20000
      2
      2
      1
      24
      2
      2
      -1
      -1
      ...
      0
      0
      0
      0
      689
      0
      0
      0
      0
      1
    
    
      1
      2
      120000
      2
      2
      2
      26
      -1
      2
      0
      0
      ...
      3272
      3455
      3261
      0
      1000
      1000
      1000
      0
      2000
      1
    
    
      2
      3
      90000
      2
      2
      2
      34
      0
      0
      0
      0
      ...
      14331
      14948
      15549
      1518
      1500
      1000
      1000
      1000
      5000
      0
    
    
      3
      4
      50000
      2
      2
      1
      37
      0
      0
      0
      0
      ...
      28314
      28959
      29547
      2000
      2019
      1200
      1100
      1069
      1000
      0
    
    
      4
      5
      50000
      1
      2
      1
      57
      -1
      0
      -1
      0
      ...
      20940
      19146
      19131
      2000
      36681
      10000
      9000
      689
      679
      0
    
  

5 rows × 25 columns

As we can see, we have only numerical attributes. Below, let's see some correlations with our dependent variable (DEFAULT)



In [5]:

    
credit.corr()["DEFAULT"]









    Out[5]:





ID          -0.013952
LIMIT_BAL   -0.153520
SEX         -0.039961
EDUCATION    0.028006
MARRIAGE    -0.024339
AGE          0.013890
PAY_0        0.324794
PAY_2        0.263551
PAY_3        0.235253
PAY_4        0.216614
PAY_5        0.204149
PAY_6        0.186866
BILL_AMT1   -0.019644
BILL_AMT2   -0.014193
BILL_AMT3   -0.014076
BILL_AMT4   -0.010156
BILL_AMT5   -0.006760
BILL_AMT6   -0.005372
PAY_AMT1    -0.072929
PAY_AMT2    -0.058579
PAY_AMT3    -0.056250
PAY_AMT4    -0.056827
PAY_AMT5    -0.055124
PAY_AMT6    -0.053183
DEFAULT      1.000000
Name: DEFAULT, dtype: float64

in that part of the code, we'll select the features of our dataset to split the dataset in test and train sets.



In [6]:

    
features = credit.columns[1:24]
target = credit.columns[24:25]



In [7]:

    
# X_train: independent (target) variables for training data set
# Y_train: dependent (outcome) variable for training data set

# X_test: independent (target) variables for the test data set
# Y_test: dependent (outcome) variable for the test data set

X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(
    credit[features].values, credit['DEFAULT'].values, test_size=0.2, random_state=0)

Let's see the shape of our datasets.



In [8]:

    
print (X_train.shape)
print (X_test.shape)
print (Y_train.shape)
print (Y_test.shape)









    



(24000, 23)
(6000, 23)
(24000,)
(6000,)

Now, we'll instance our objects with the classifiers.



In [9]:

    
rfc = RandomForestClassifier(n_estimators=100, min_samples_leaf=10, random_state=1, n_jobs=2)
gbc = GradientBoostingClassifier()
etc = ExtraTreesClassifier()
abc = AdaBoostClassifier()
svc = SVC()
knc = KNeighborsClassifier()
dtc = DecisionTreeClassifier()
ptc = Perceptron()
lrc = LogisticRegression()

With the training sets, we'll fit all models for each classifier.



In [10]:

    
rfc.fit(X_train, Y_train)
gbc.fit(X_train, Y_train)
etc.fit(X_train, Y_train)
abc.fit(X_train, Y_train)
svc.fit(X_train, Y_train)
knc.fit(X_train, Y_train)
dtc.fit(X_train, Y_train)
ptc.fit(X_train, Y_train)
lrc.fit(X_train, Y_train)









    Out[10]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

We'll build an object called expected with our target variables of training set. We'll use this to see the adherence of the model and see the errors.



In [11]:

    
expected = Y_train

Now we'll use the predict method over our training atributes to build every prediction object.



In [12]:

    
predicted_rfc = rfc.predict(X_train)
predicted_gbc = gbc.predict(X_train)
predicted_etc = etc.predict(X_train)
predicted_abc = abc.predict(X_train)
predicted_svc = svc.predict(X_train)
predicted_knc = knc.predict(X_train)
predicted_dtc = dtc.predict(X_train)
predicted_ptc = ptc.predict(X_train)
predicted_lrc = lrc.predict(X_train)

If you feel confortable to see every classification report, feel free to execute this code below (will be deprecated in next version).



In [13]:

    
print(metrics.classification_report(expected, predicted_rfc))
print(metrics.classification_report(expected, predicted_gbc))
print(metrics.classification_report(expected, predicted_etc))
print(metrics.classification_report(expected, predicted_abc))
print(metrics.classification_report(expected, predicted_svc))
print(metrics.classification_report(expected, predicted_knc))
print(metrics.classification_report(expected, predicted_dtc))
print(metrics.classification_report(expected, predicted_ptc))
print(metrics.classification_report(expected, predicted_lrc))









    



             precision    recall  f1-score   support

          0       0.86      0.97      0.91     18661
          1       0.80      0.44      0.57      5339

avg / total       0.84      0.85      0.83     24000

             precision    recall  f1-score   support

          0       0.84      0.95      0.89     18661
          1       0.70      0.38      0.49      5339

avg / total       0.81      0.83      0.80     24000

             precision    recall  f1-score   support

          0       1.00      1.00      1.00     18661
          1       1.00      1.00      1.00      5339

avg / total       1.00      1.00      1.00     24000

             precision    recall  f1-score   support

          0       0.83      0.96      0.89     18661
          1       0.68      0.32      0.43      5339

avg / total       0.80      0.82      0.79     24000

             precision    recall  f1-score   support

          0       0.99      1.00      1.00     18661
          1       1.00      0.98      0.99      5339

avg / total       0.99      0.99      0.99     24000

             precision    recall  f1-score   support

          0       0.83      0.95      0.89     18661
          1       0.67      0.34      0.45      5339

avg / total       0.80      0.82      0.79     24000

             precision    recall  f1-score   support

          0       1.00      1.00      1.00     18661
          1       1.00      1.00      1.00      5339

avg / total       1.00      1.00      1.00     24000

             precision    recall  f1-score   support

          0       0.82      0.44      0.57     18661
          1       0.25      0.67      0.37      5339

avg / total       0.69      0.49      0.52     24000

             precision    recall  f1-score   support

          0       0.78      1.00      0.87     18661
          1       0.00      0.00      0.00      5339

avg / total       0.60      0.78      0.68     24000

The same above applies for the confusion matrix for each classifier.



In [14]:

    
print(metrics.confusion_matrix(expected, predicted_rfc))
print(metrics.confusion_matrix(expected, predicted_gbc))
print(metrics.confusion_matrix(expected, predicted_etc))
print(metrics.confusion_matrix(expected, predicted_abc))
print(metrics.confusion_matrix(expected, predicted_svc))
print(metrics.confusion_matrix(expected, predicted_knc))
print(metrics.confusion_matrix(expected, predicted_dtc))
print(metrics.confusion_matrix(expected, predicted_ptc))
print(metrics.confusion_matrix(expected, predicted_lrc))









    



[[18083   578]
 [ 3006  2333]]
[[17776   885]
 [ 3314  2025]]
[[18660     1]
 [    8  5331]]
[[17884   777]
 [ 3656  1683]]
[[18642    19]
 [  125  5214]]
[[17785   876]
 [ 3542  1797]]
[[18660     1]
 [    8  5331]]
[[ 8136 10525]
 [ 1777  3562]]
[[18659     2]
 [ 5339     0]]

Now we'll predict with our test dataset to see the adherence of our models.



In [15]:

    
predictions_rfc = rfc.predict(X_test)
predictions_gbc = gbc.predict(X_test)
predictions_etc = etc.predict(X_test)
predictions_abc = abc.predict(X_test)
predictions_svc = svc.predict(X_test)
predictions_knc = knc.predict(X_test)
predictions_dtc = dtc.predict(X_test)
predictions_ptc = ptc.predict(X_test)
predictions_lrc = lrc.predict(X_test)

Let's store our Mean Squared Error for each classifier.



In [16]:

    
mse_rfc = mean_squared_error(predictions_rfc, Y_test)
mse_abc = mean_squared_error(predictions_abc, Y_test)
mse_etc = mean_squared_error(predictions_etc, Y_test)
mse_gbc = mean_squared_error(predictions_gbc, Y_test)
mse_svc = mean_squared_error(predictions_svc, Y_test)
mse_knc = mean_squared_error(predictions_knc, Y_test)
mse_dtc = mean_squared_error(predictions_dtc, Y_test)
mse_ptc = mean_squared_error(predictions_ptc, Y_test)
mse_lrc = mean_squared_error(predictions_lrc, Y_test)

Now the scores:



In [17]:

    
print('RMSE - Random Forests:',round(mse_rfc,3) )
print('RMSE - Gradient Boosting:',round(mse_gbc,3) )
print('RMSE - Extra Trees:',round(mse_etc,3) )
print('RMSE - Ada Boosting:',round(mse_abc,3) )
print('RMSE - SVM:',round(mse_svc,3) )
print('RMSE - KNN:',round(mse_knc,3) )
print('RMSE - Decision Trees:',round(mse_dtc,3) )
print('RMSE - Perceptron:',round(mse_ptc,3) )
print('RMSE - Logistic Regression:',round(mse_lrc,3) )









    



RMSE - Random Forests: 0.172
RMSE - Gradient Boosting: 0.172
RMSE - Extra Trees: 0.195
RMSE - Ada Boosting: 0.174
RMSE - SVM: 0.216
RMSE - KNN: 0.238
RMSE - Decision Trees: 0.263
RMSE - Perceptron: 0.518
RMSE - Logistic Regression: 0.216

Ok, let's ranking our algorithms to see the best one to start our analysis.



In [18]:

    
algorithms = {'Algorithm': ['Random Forests', 'Gradient Boosting', 'Extra Trees', 'Ada Boosting', 'SVM', 'KNN', 'Decision Trees', 'Perceptron', 'Logistic Regression'],
        'MSE': [round(mse_rfc,4), round(mse_gbc,4), round(mse_etc,4), round(mse_abc,4), round(mse_svc,4), round(mse_knc,4), round(mse_dtc,4), round(mse_ptc,4), round(mse_lrc,4)]}

# Transform in a data frame of Pandas to sorting
algos = pd.DataFrame(algorithms)

algos.sort_values(by='MSE', ascending=1)









    Out[18]:






  
    
      
      Algorithm
      MSE
    
  
  
    
      1
      Gradient Boosting
      0.1720
    
    
      0
      Random Forests
      0.1723
    
    
      3
      Ada Boosting
      0.1740
    
    
      2
      Extra Trees
      0.1947
    
    
      4
      SVM
      0.2158
    
    
      8
      Logistic Regression
      0.2160
    
    
      5
      KNN
      0.2377
    
    
      6
      Decision Trees
      0.2632
    
    
      7
      Perceptron
      0.5178

As we can see, the Gradient Boosting algorithm shows the best performance with default attributes for this dataset. We can start our analysis our development based in this algorithm.

There's a lot work to do, but this is the begining. Thanks for reading.

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	DEFAULT
0	1	20000	2	2	1	24	2	2	-1	-1	...	0	0	0	0	689	0	0	0	0	1
1	2	120000	2	2	2	26	-1	2	0	0	...	3272	3455	3261	0	1000	1000	1000	0	2000	1
2	3	90000	2	2	2	34	0	0	0	0	...	14331	14948	15549	1518	1500	1000	1000	1000	5000	0
3	4	50000	2	2	1	37	0	0	0	0	...	28314	28959	29547	2000	2019	1200	1100	1069	1000	0
4	5	50000	1	2	1	57	-1	0	-1	0	...	20940	19146	19131	2000	36681	10000	9000	689	679	0

	Algorithm	MSE
1	Gradient Boosting	0.1720
0	Random Forests	0.1723
3	Ada Boosting	0.1740
2	Extra Trees	0.1947
4	SVM	0.2158
8	Logistic Regression	0.2160
5	KNN	0.2377
6	Decision Trees	0.2632
7	Perceptron	0.5178